In this section we will be working with shapefiles. More specifically how to read in a shapefile and join this to our aggregated crime count data frame. From there we introduce classification methods as a way to better visualize crime counts.
As always the first step is to load the necessary R packages via the library function. If you do not have these packages installed then please follow the instructions in the Preliminary Task.Rmd file.
# for data reading/manipulation
library(dplyr)
library(tidyr)
library(readr)
library(tibble)
library(janitor)
# for spatial data and gis
library(sf)
library(ggmap)
library(ggplot2)
library(ggspatial)
library(spdep)
library(leaflet)
library(RColorBrewer)
library(tmap)
They represent a geospatial vector that is used for GIS software. Shapefiles store both geogrpahic location and its associated attribute information
The Shapefile format stores the data as primitive geometric shapes like points, lines, and polygons. These shapes, together with data attributes that are linked to each shape, create the representation of the geographic data.
They contain four mandatory file extensions (.shx, .shp, .dbf and the .prj). - The .shp contains the geometry data (a 2D axis ordering of coordinate data) - The .shx contains the positional index of the feature geometry - The .dbf contins the attributes for each shape - The .prj contains the cs and projection information
In criminological research, the LSOA is quite frequently used as the main census geography
I collected my shapefile data via the UKDS Census Support (https://borders.ukdataservice.ac.uk/bds.html). If you want specific information about how to use the website, please refer to the Downloading the data document again
shp_file <- sf::st_read("/Users/shihaitao/Library/Mobile Documents/com~apple~CloudDocs/Documents/M1_Documents/爱丁堡大学培训/MappingCrimeData/MappingCrimeData/Data/Shapefile/england_lsoa_2021.shp")
Reading layer `england_lsoa_2021' from data source
`/Users/shihaitao/Library/Mobile Documents/com~apple~CloudDocs/Documents/M1_Documents/爱丁堡大学培训/MappingCrimeData/MappingCrimeData/Data/Shapefile/england_lsoa_2021.shp'
using driver `ESRI Shapefile'
Simple feature collection with 55 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 485406.9 ymin: 154122.5 xmax: 501181.2 ymax: 166842.9
Projected CRS: OSGB36 / British National Grid
You can also use the head() function in shapefiles!
head(shp_file)
Simple feature collection with 6 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 487591.7 ymin: 156900.2 xmax: 500937.4 ymax: 166842.9
Projected CRS: OSGB36 / British National Grid
name label lsoa21nm lsoa21cd geometry
1 Surrey Heath 001A E92000001E12000008E07000214E02006416E01030763 Surrey Heath 001A E01030763 MULTIPOLYGON (((496838.5 16...
2 Surrey Heath 006B E92000001E12000008E07000214E02006421E01030762 Surrey Heath 006B E01030762 MULTIPOLYGON (((494998.5 16...
3 Surrey Heath 007D E92000001E12000008E07000214E02006422E01030777 Surrey Heath 007D E01030777 MULTIPOLYGON (((490461 1601...
4 Surrey Heath 007C E92000001E12000008E07000214E02006422E01030776 Surrey Heath 007C E01030776 MULTIPOLYGON (((490895.1 16...
5 Surrey Heath 001D E92000001E12000008E07000214E02006416E01030809 Surrey Heath 001D E01030809 MULTIPOLYGON (((493209 1648...
6 Surrey Heath 011C E92000001E12000008E07000214E02006426E01030772 Surrey Heath 011C E01030772 MULTIPOLYGON (((488055.3 15...
#or use 'View(shp_file) to view the full dataset which will open up in a new panel
Further Information - To clarify, this is an ‘empty shapefile’, it simply contains the boundary profile of Surrey Heath and does not contain any further attribute information. However, if it did contain further attribute information such as the crime counts, population statistics, IMD counts, then you would not need to join the data as we do in this workshop, but instead you could layer the shapefile over our simple feature object created in Topic 1. More information on this method is available from the workshop that was held in February - all resources are available via the ‘Feb_2021’ folder from the github link [https://github.com/UKDataServiceOpen/Crime_Data_in_R.git]
Lets plot the empty shapefile for Surrey Heath to see what we’re actually looking at.
## Plot the Shapefile
ggplot() +
geom_sf(data = shp_file)
As you can seem, we have a map that details the borders (i.e. shape) between each LSOA in Surrey.
This workshop instead joins the ‘crime’ datatset (in tibble format) to the above shapefile. Our newly created object ‘shp_file’ is in fact a sf object, which is short for a ‘simple feature object’. You can check this by typing class(shp_file).
class(shp_file)
[1] "sf" "data.frame"
So in total, the shapefile consits of 5 variables. The first 4 variables indicate information about that specific LSOA, we are given the name, LSOA code and LSOA name. We can ignore the column ‘label’ as this is just another reference point.
The column I want to draw attention to is the ‘geometry column’
attributes(shp_file$geometry)
$n_empty
[1] 0
$crs
Coordinate Reference System:
User input: OSGB36 / British National Grid
wkt:
PROJCRS["OSGB36 / British National Grid",
BASEGEOGCRS["OSGB36",
DATUM["Ordnance Survey of Great Britain 1936",
ELLIPSOID["Airy 1830",6377563.396,299.3249646,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4277]],
CONVERSION["British National Grid",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",49,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-2,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996012717,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",400000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",-100000,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Engineering survey, topographic mapping."],
AREA["United Kingdom (UK) - offshore to boundary of UKCS within 49°45'N to 61°N and 9°W to 2°E; onshore Great Britain (England, Wales and Scotland). Isle of Man onshore."],
BBOX[49.75,-9,61.01,2.01]],
ID["EPSG",27700]]
$class
[1] "sfc_MULTIPOLYGON" "sfc"
$precision
[1] 0
$bbox
xmin ymin xmax ymax
485406.9 154122.5 501181.2 166842.9
The geometry column can be split into two key sections; the feature and the geometry - The feature in this case is our polygon level (referenced by the multipolygon) which is in fact a simple feature geometery list- column (sfc) - The geometery are the numbers that follow, and more technically known as a ’simple feature geometry (sfg)
The column in the sf data.frame that contains the geometries is a list, of class sfc. We can retrieve the geometry list-column in this case by using st_geometry.
st_geometry(shp_file)
Geometry set for 55 features
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 485406.9 ymin: 154122.5 xmax: 501181.2 ymax: 166842.9
Projected CRS: OSGB36 / British National Grid
First 5 geometries:
MULTIPOLYGON (((496838.5 166608.5, 496888 16656...
MULTIPOLYGON (((494998.5 160258.5, 494999.7 160...
MULTIPOLYGON (((490461 160153.2, 490461.6 16015...
MULTIPOLYGON (((490895.1 160221.4, 490913.8 160...
MULTIPOLYGON (((493209 164876, 493221 164874, 4...
Now we have a basic understanding of what a shapefile is and how we can import them into r, the next step is run some data manipulation and create some new dataframes that can work with the format of shapefiles.
The original crime data set contains the individual count of reported crime types across LSOAS, therefore the LSOAs are repeated multiple times. This is because you would expect to see multiple crime counts in one LSOA.
In order to highlight how many crimes have occurred in each LSOA, we can count the crimes per LSOA and obtained grouped statistics.
crimes_grouped_by_lsoa <- crime %>%
group_by(lsoa_code) %>%
summarise(count=n())
head(crimes_grouped_by_lsoa)
In our new object you will see two variables, the LSOA and the count of crime in each one.
We can now join the Shapefile (the geospatial vector) and the crimes_grouped_by_losa (the aggregated data)
To join the crimes per lsoa to the shapefile we can use the left_join function that returns all the rows of the table on the left side of the join and matching rows for the table on the right side of join.
surrey_lsoa <- left_join(shp_file, crimes_grouped_by_lsoa, by = c("lsoa21cd" = "lsoa_code"))
head(surrey_lsoa)
Simple feature collection with 6 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 487591.7 ymin: 156900.2 xmax: 500937.4 ymax: 166842.9
Projected CRS: OSGB36 / British National Grid
name label lsoa21nm lsoa21cd count
1 Surrey Heath 001A E92000001E12000008E07000214E02006416E01030763 Surrey Heath 001A E01030763 12
2 Surrey Heath 006B E92000001E12000008E07000214E02006421E01030762 Surrey Heath 006B E01030762 6
3 Surrey Heath 007D E92000001E12000008E07000214E02006422E01030777 Surrey Heath 007D E01030777 3
4 Surrey Heath 007C E92000001E12000008E07000214E02006422E01030776 Surrey Heath 007C E01030776 3
5 Surrey Heath 001D E92000001E12000008E07000214E02006416E01030809 Surrey Heath 001D E01030809 3
6 Surrey Heath 011C E92000001E12000008E07000214E02006426E01030772 Surrey Heath 011C E01030772 2
geometry
1 MULTIPOLYGON (((496838.5 16...
2 MULTIPOLYGON (((494998.5 16...
3 MULTIPOLYGON (((490461 1601...
4 MULTIPOLYGON (((490895.1 16...
5 MULTIPOLYGON (((493209 1648...
6 MULTIPOLYGON (((488055.3 15...
st_geometry_type(surrey_lsoa) #view the geometery type
[1] MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON
[10] MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON
[19] MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON
[28] MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON
[37] MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON
[46] MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON MULTIPOLYGON
[55] MULTIPOLYGON
18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT MULTILINESTRING MULTIPOLYGON GEOMETRYCOLLECTION ... TRIANGLE
st_bbox(surrey_lsoa) #obtains the objects value as specific units
xmin ymin xmax ymax
485406.9 154122.5 501181.2 166842.9
#The spatial extent of a shapefile or R spatial object represents the geographic “edge” or location that is the #furthest north, south east and west. Thus is represents the overall geographic coverage of the spatial object.
Now lets map the new data!
#map the data
ggplot() +
annotation_map_tile() +
geom_sf(data = surrey_lsoa, aes(fill = count), alpha = 0.5) +
scale_fill_gradient2(name ="Number of crimes")
|
| | 0%
|
|============================== | 25%
|
|============================================================ | 50%
|
|========================================================================================= | 75%
|
|=======================================================================================================================| 100%
The tmap package allows you tp create thematic maps, the syntax is very similar to the ggplot2. Each map can be plot as an image or as an interactive map via the tmap_mode(“view” / “plot” ) function.
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(surrey_lsoa) +
tm_fill("count") +
tm_borders("green", lwd = 0.7, alpha = 0.5)
#tm_text("name", size = "AREA", col = "black")
#tmap_style("col_blind")
How can we better visualise counts? Count data does not equally represent the population distribution at hand, tmaps allows you to alter the characteristics of theamatic maps via the ‘styles’ function. The different styles result in different binning techniques. Now, when mapping quantitiatve data such as crime counts, typicaly the variables needed to be put into to ‘bins’. As seen in the previous example, the default binning applied to highlight the LSOAs grouped started from 1-10, 11-20, 21-20, 31-40, 41-50 and 51-60 crimes.
These bins were decided on automatically, however we can define more accurate classes that best refelct the distributional character of the data set.
In this example I’ve used the “kmeans”, “jenks” and “sd”.
a <- tm_shape(surrey_lsoa) +
tm_fill("count", style = "kmeans") +
tm_borders(alpha = 0.3)
b <- tm_shape(surrey_lsoa) +
tm_fill("count", style = "jenks") +
tm_borders(alpha = 0.3)
c <- tm_shape(surrey_lsoa) +
tm_fill("count", style = "sd") +
tm_borders(alpha = 0.3)
## tmap_arrange
tmap_mode("plot")
tmap mode set to plotting
tmap_arrange(a, b, c)
NA
NA
Just like the tmap_arrange function, tmap_facets are way to produce side-by-side maps (known as small multiples). It is similar to the ‘facet_grid’ function in ggplot2
Following the rdocumentation [https://www.rdocumentation.org/packages/tmap/versions/3.3-2/topics/tm_facets] “Small multiples can be created in two ways: 1) by specifying the by argument with one or two variable names, by which the data is grouped, 2) by specifying multiple variable names in any of the aesthetic argument of the layer functions (for instance, the argument col in tm_fill).”
Typically tm_facets are defined by a categorical variables. For example in this example, I am using tm_facets() to seperate the map into multiple components by lsoa (This isnt’t the best example of a categorical variable, but something like the urban or rural landscape, or deprivation decile, would be more of interest in criminology). You could for example download the 2011 rural/urban classification from open geography portal and join this to our ‘surrey_lsoa’ sf object (using left_join).
If you are interested in doing so the dataset can be found here; [https://www.ons.gov.uk/methodology/geography/geographicalproducts/ruralurbanclassifications/2011ruralurbanclassification]
tm_shape(surrey_lsoa) +
tm_fill("count",
style = "quantile",
palette = "Blues",
thres.poly = 0) +
tm_facets(by="name",
free.coords=TRUE,
drop.shapes=TRUE) +
tm_layout(legend.show = FALSE,
title.position = c("center", "center"),
title.size = 20) +
tm_borders(alpha = 0.5)
tm_shape(surrey_lsoa) +
tm_fill("count", style = "sd") +
tm_borders(alpha = 0.3) +
tmap_style("col_blind")
tm_shape(surrey_lsoa)+
tm_fill("count",
style = "quantile",
palette = "Blues",
legend.hist = TRUE,
legend.is.portrait = TRUE,
legend.hist.z = 0.1) +
tm_layout(legend.height = 0.45,
legend.width = 0.35,
legend.outside = FALSE,
legend.position = c("right", "bottom"),
frame = FALSE) +
tm_borders(alpha = 0.5)
tm_shape(surrey_lsoa)+
tm_fill("count",
style = "quantile",
palette = "Blues",
legend.hist = TRUE,
legend.is.portrait = TRUE,
legend.hist.z = 0.1) +
tm_layout(legend.height = 0.45,
legend.width = 0.35,
legend.outside = FALSE,
legend.position = c("right", "bottom"),
frame = FALSE) +
tm_borders(alpha = 0.5) +
tm_compass(type="8star", size = 2) + #compass
tm_scale_bar(width = 0.15) + #scale bar
tm_grid() #grid
Explore some of the different classification methods such as “bclust” and “hclust” - what are the main differences? To get help on the different methods available use ??tmap-package or search in the help tab
Assign your new bclust and hclust classification maps into separate objects (call them “h” and “b” and plot them together using tmap_arrange()
Plot an interactive map using the “bclust” classification method by changing the command in the tmap_mode() function
#1)
h <- tm_shape(surrey_lsoa) +
tm_fill("count", style = "hclust") +
tm_borders(alpha = 0.3)
b <- tm_shape(surrey_lsoa) +
tm_fill("count", style = "bclust") +
tm_borders(alpha = 0.3)
#2)
tmap_arrange(h,b)
Committee Member: 1(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)
Error in bclust(x = var, centers = n, ...) :
Could not find valid cluster solution in 20 replications
Committee Member: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1) 10(1)
Computing Hierarchical Clustering
Committee Member: 1(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)
Error in bclust(x = var, centers = n, ...) :
Could not find valid cluster solution in 20 replications
Committee Member: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1) 10(1)
Computing Hierarchical Clustering
#3)
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(surrey_lsoa) +
tm_fill("count", style = "bclust") +
tm_borders(alpha = 0.3)
Committee Member: 1(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)
Error in bclust(x = var, centers = n, ...) :
Could not find valid cluster solution in 20 replications
Committee Member: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1) 10(1)
Computing Hierarchical Clustering